Learning Relations Using Collocations

نویسندگان

  • Gerhard Heyer
  • Martin Läuter
  • Uwe Quasthoff
  • Thomas Wittig
  • Christian Wolff
چکیده

This paper describes the application of statistical analysis of large corpora to the problem of extracting semantic relations from unstructured text. We regard this approach as a viable method for generating input for the construction of ontologies as ontologies use well-defined semantic relations as building blocks (cf. van der Vet & Mars 1998). Starting from a short description of our corpora as well as our language analysis tools, we discuss in depth the automatic generation of collocation sets. We further give examples of different types of relations that may be found in collocation sets for arbitrary terms. The central question we deal with here is how to postprocess statistically generated collocation sets in order to extract named relations. We show that for different types of relations like cohyponyms or instance-of-relations, different extraction methods as well as additional sources of information can be applied to the basic collocation sets in order to verify the existence of a specific type of semantic relation for a given set of terms. 1 Analysis of Large Text Corpora Corpus Linguistics is generally understood as a branch of computational linguistics dealing with large text corpora for the purpose of statistical processing of language data (cf. Armstrong 1993, Manning & Schütze 1999). With the availability of large text corpora and the success of robust corpus processing in the nineties, this approach has recently become increasingly popular among computational linguists (cf. Sinclair 1991, Svartvik 1992). Since 1995 a German text corpus of more than 300 million words has been collected (cf. Quasthoff 1998B, Quasthoff & Wolff 2000), containing approx. 6 million different word forms in approx. 13 million sentences, which serves as input for the analysis methods described below. Similarly structured corpora have recently been set up for other European languages as well (English, French, Dutch), with more languages to follow in the near future (see table 1). German English Dutch French word tokens 300 M 250 M 22 M 15 M sentences 13.4 M 13 M 1.5 M 860,000 word types 6 M 1.2 M 600,000 230,000 Table 1: Basic Characteristics of the Corpora The basic goal of this corpus-based approach is to collect large amounts of textual data as input for semantic processing. Starting off from a rather simple data model tailored for large amounts of data and efficient processing using a relational data base system at storage level we employ a simple yet powerful technical infrastructure for processing texts to be included in the corpus. Beside basic procedures for text integration into the corpus various tools have been developed for post-processing linguistic data. Among them the automatic calculation of sentencebased word collocations stands out as an especially valuable tool for corpus-based language technology applications (see Quasthoff 1998A, Quasthoff & Wolff 2000). Additional, application oriented tools exist for search engine optimization as well as automatic document classification (see Heyer, Quasthoff & Wolff 2000). The corpora are available on the WWW (http://www. wortschatz. uni-leipzig.de) and may be used as a large online dictionary.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Classifying Functional Relations in Factotum via WordNet Hypernym Associations

This paper describes how to automatically classify the functional relations from the Factotum knowledge base via a statistical machine learning algorithm. This incorporates a method for inferring prepositional relation indicators from corpus data. It also uses lexical collocations (i.e., word associations) and class-based collocations based on the WordNet hypernym relations (i.e., is-subset-of)...

متن کامل

The effect of implicit input enhancement on learning grammatical collocations

Collocation is known as one of the most problematic areas in learning a second language and it seems that if one has tendency to improve his or her communication ability in another language, the elaboration of collocation using competence is among the most important issues. This study investigated the role of implicit input enhancement in teaching grammatical collocations for Iranian EFL learne...

متن کامل

The Effect of Visually-Mediated Collocations on the Elementary EFL Learners’ Vocabulary Learning

When vocabulary teaching is taken into account in EFL classes in our Iranian state primary schools, teachers generally prefer to use classical techniques. The purpose of this study was to investigate the effect of visually-mediated collocations on the elementary EFL learners’ vocabulary learning. In order to conduct this study, 60 students from two classrooms in an elementary class, participate...

متن کامل

Integrating Interactive Whiteboards in EFL Learners' Learning and Retention of Non-congruent Collocations

Drawing on the assumptions of socio-cognitive linguistics, focusing on the effective role of interaction in terms of reducing the cognitive burden in the process of learning, this quasi-experimental study aimed at investigating the effect of the Interactive Whiteboard (IWB) usage on the learning and retention of non-congruent collocations among 60 homogenized Iranian EFL learners, aged 18 to 24...

متن کامل

The Effects of Task Orientation and Involvement Load on Learning Collocations

This study examined the effects of input-oriented and output-oriented tasks with different involvement load indices on Iranian EFL learners' comprehension and production of lexical collocations. To achieve this purpose, a sample of 180 intermediate-level EFL learners (both male and female) participated in the study. The participants were in six experimental groups. Each of the groups was random...

متن کامل

Extracting Academic Subjects Semantic Relations Using Collocations

The paper presents approach to analyze semantic content of academic subjects and its internal relations using statistically-based techniques for collocation extraction from large electronic educational text corpus. It offers a survey and analysis of some related corpus-based approaches to extract conceptual relations used for educational purpose and presents a technique for semantic search of c...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2001